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Abstract 

Several popular graph embedding techniques for rep¬ 
resentation learning and dimensionality reduction 
rely on performing computationally expensive eigen- 
decompositions to derive a nonlinear transformation 
of the input data space. The resulting eigenvectors 
encode the embedding coordinates for the training 
samples only, and so the embedding of novel data 
samples requires further costly computation. In this 
paper, we present a method for the out-of-sample ex¬ 
tension of graph embeddings using deep neural net¬ 
works (DNN) to parametrically approximate these 
nonlinear maps. Compared with traditional nonpara- 
metric out-of-sample extension methods, we demon¬ 
strate that the DNNs can generalize with equal or 
better fidelity and require orders of magnitude less 
computation at test time. Moreover, we find that 
unsupervised pretraining of the DNNs improves opti¬ 
mization for larger network sizes, thus removing sen¬ 
sitivity to model selection. 


1 Introduction 

Manifold learning is a popular data analysis 
framework that attempts to recover compact 
low-dimensional embeddings of high-dimensional 
datasets. Several manifold learning algorithms— 


derive coordinate representations that encode the 
local neighborhood structure of an unlabeled data 
sample. These techniques have found considerable 
success in a wide array of application domains, in¬ 


eluding computer vision ( 

Elgammal and Lee 2004 

Murphy-Chutorian and Trivedi| 20091 He et al. 

2005 

, speech processing ( 

Jansen and Niyogil 2006 

Jansen et al. 20I2| Jansen and Niyogi 20I3| |Tomar 

and Rose| 20I3| Sahraeian and Van Compernollel 

2013 

Mousazadeh and Cohen 

2013), and natural 

language processing (|Shi et al. 

2007 Solka et al. 

2008 

. In (Yan et al. 2007 

), it was shown that these 


algorithms are all members of a more general graph 
embedding framework, in which the transformations 
are derived via a generalized eigendecomposition of 
the graph Laplacian matrix operator for algorithm- 
specific graph construction methodologies. 

In their basic form, these graph embedding tech¬ 
niques only provide transformations of the training 
samples used to construct the graph. Thus, even if 
a large training set is used, computing the output 
of the estimated map for a novel test sample is not 
possible. To address this shortcoming, a nonpara- 
metric out-of-sample extension technique based on 
Nystrom sampling was developed that leverages the 
input and target representation pairs for each train¬ 
ing sample to approximate what the map would have 


Tenenbaum et al.| 

2000), lo- 

2004 

Kumar et al. 

2012 


cally linear embedding (Roweis and Saul 2000 2003), 


diffusion maps (Coifman and Lafon 2006), and 


Laplacian eigenmaps (Belkin and Niyogi, 2003)— 


the Nystrom extension is a kernel-based method with 
time complexity that scales linearly with the num¬ 
ber of training samples. This increase in computa- 
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tional cost is especially problematic because manifold 
methods are most effective when provided the benefit 
of large training sets for representation learning. It 
would be highly beneficial to remove this trade-off be¬ 
tween representation quality and extension feasibility 
with a more efficiently scaleable method for out-of- 
sample extension. 

Neural networks have long been known to be a 
powerful learning framework for classification and 
regression, capable of distilling large training sets 
into efficiently evaluated parametric models, and thus 
are a natural choice for modeling manifold embed¬ 


dings. In their seminal paper, Hornik et al. (Hornik 


et al. 1989) proved that feedforward neural net¬ 


works can approximate a virtually arbitrary deter¬ 
ministic map between high-dimensional spaces, indi¬ 
cating that they would also be ideally suited for our 
out-of-sample extension problem. However, there are 
two caveats for the use of neural networks as univer¬ 
sal approximators: (i) there must be sufficient hid¬ 
den units (i.e. sufficient model parameters), which 
in turn require additional data samples for training 
without overfitting; and (ii) the non-convexity of the 
objective function grows with the number of model 
parameters, making the search for reliable global so¬ 
lutions increasingly difficult. 

With these considerations in mind, we explore the 
application of recent advances in deep neural network 
(DNN) training methodology to the out-of-sample ex¬ 
tension problem. First, by stabilizing the Lanczos 
eigendecomposition algorithm, we are able to pro¬ 
duce exact graph embeddings for training sets with 
millions of data samples. This permits an extensive 
study with deeper architectures than have been pre¬ 
viously considered for the task. Second, motivated by 


success in the supervised classification setting (Ben- 


gio, 2009 Bengio et al., 20071, we consider unsuper¬ 


vised DNN pretraining procedures to improve opti¬ 
mization as our larger training samples support com¬ 
mensurate increases in model complexity. 

In the work that follows, we compare the perfor¬ 
mance of our parametric DNN approach against a 
Nystrom sampling baseline, both in terms of approx¬ 
imation fidelity and test runtime. We find DNNs 
to match or outperform the approximation fidelity 
of the Nystrom method for all training sample sizes. 


Furthermore, since the DNN approach is parametric, 
its test-time complexity for fixed network size is con¬ 
stant in the training sample size, producing orders-of- 
magnitude speedup over Nystrom sampling for larger 
training sizes. The remainder of this paper is orga¬ 
nized as follows. We begin with an overview of prior 
work in out-of-sample extension for graph embed¬ 
dings. We then describe the strategy for stabilizing 
eigendecompositions for large training sets, followed 
by a description of the process for training our DNN 
out-of-sample extension to approximate the embed¬ 
ding for unseen data. Finally, we analyze the recon¬ 
struction accuracy and computation speed of both 
the Nystrom baseline and the DNN approach. 


2 Prior Work 


The most popular methods for extending graph em¬ 
beddings to unseen data have been based on Nystrom 


sampling (Bengio et al. 2004 Kumar et al. 2012) 


and thus they will serve as the baseline in our exper¬ 
iments. This is a nonparametric, kernel-based tech¬ 
nique that approximates the embedding of each test 
sample by computing a weighted interpolation of the 
embeddings for training samples that were nearby 
in the original input space. Formally (see (Bengio 
et al. 2004) for details), let X = {xi,... ,x„} be 
the set of graph embedding training samples, where 
each Xi G Let C be the symmetric, normalized 
graph Laplacian operator defined for the set X such 
that C = I — where Aij = K{xi,Xj) 

for some positive semidefinite kernel function K that 
must be specialized for each specific graph embed¬ 
ding algorithm; and D is the diagonal matrix defined 
via Di^i — Let the spectral decomposition of 


the normalized Laplacian be denoted as £ = I/EC/^, 
where the diagonal entries of E are non-increasing. 
The d'-dimensional embedding of X is then provided 
by the first d' columns of U, which we shall denote 
as Ud'- Stated simply, the embedding of Xi is given 
by the i-th row of Ud'- 


To embed an out-of-sample data point a: S via 
the Nystrom extension, the p-th dimension of the ex- 
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tension yp{x) is given by 


Vpix) = — 


( 1 ) 


We see that the complexity of this extension is lin¬ 
ear in the size of the training set. In practice, ap¬ 
proximate nearest neighbor techniques can be used 
to speed this up with minimal loss in hdelity (the im¬ 
plementation we benchmark uses k-d tree for this pur¬ 
pose), but the algorithmic complexity still increases 
with training size. Finally, note that a nearly equiva¬ 
lent formulation based on reproducing Kernel Hilbert 


space theory was presented in (Belkin et al. 20061, 


are implemented) and thus can get bogged down as 
we feed more data to the graph embedding train¬ 
ing. We begin this section with a simple trick for the 
eigendecomposition of large graph Laplacians, which 
permits larger training sets and motivates the need 
for more computationally efficient extension meth¬ 
ods. This is followed by a presentation of the deep 
neural network architecture we propose to efficiently 
extend the embedding to arbitrary test points. 


3.1 Stabilizing the Eigendecomposi¬ 
tion 


where the kernelization was introduced into the ob¬ 
jective function before the eigendecomposition is per¬ 
formed. This formulation has the same scalability 
limitations as the Nystrom extension. These compu¬ 
tational difficulties motivate our exploration of DNNs 
to model embeddings for out-of-sample extension. 

Traditional neural networks have also been consid¬ 
ered for out-of-sample extension in the past in two 
limited studies involving small datasets and model 


In (Golub and Van Loan 2012), it is suggested that 


the stability of the Lanczos eigendecomposition al¬ 
gorithm can be greatly increased (and memory re¬ 
quirements consequently reduced) by reformulating 
the eigenproblem to recover the largest eigenvalues. 
We can exploit this by observing that if v is an eigen¬ 
vector of C with eigenvalue A, then v is also an eigen¬ 
vector of £ = / — £ with eigenvalue 1 — A (which 
is guaranteed to be less than or equal to 1). Thus, 


architectures ( 

Gong et al.| 2006 

Ghin et al. 

2007 

|. with this small redefinition of the eigenproblem, we 

The idea was 

introduced in (Gong et al. 

2006 

), but can recover the same eigenvectors by considering the 


the study failed to include a meaningful quantitative 
evaluation. The experiments in (Chin et al. 2007), 


which predated the advent of recent deep learning 
training methodologies, found neural networks to be 
one of the worst performing methods. However, with 


a similar motive of computational efficiency, (Gregor 


and LeCun 2010) explored the use of DNNs for ap¬ 


proximating expensive sparse coding transformations 
and produced more compelling results. 

3 A Scalable Out-of-Sample 
Extension 

A truly scalable out-of-sample extension must si¬ 
multaneously consume a large amount of training 
data for detailed modeling and provide a test-time 
complexity that does not strongly depend on that 
training set size. The nonparametric nature of the 
Nystrom method leads to a linear dependence on the 
training set size (logarithmic if kernel approximations 


the ARPACK implementation, a similar effect can 
also be accomplished by searching for the smallest 
algebraic eigenvalues of £ directly. 

While this trick is by no means a fundamental 
theoretical innovation on our part, its effects have 
proven dramatic. Our past efforts to solve for the 
smallest magnitude eigenvalues of the graph Lapla- 
cian exceeded our hardware memory limits when our 
graphs reached the order of 100,000 nodes and I mil¬ 
lion edges. Employing this simple trick, we have now 
succeeded in processing graphs with order 100 mil¬ 
lion nodes and order 10 billion edges on conventional 
hardware, stably solving for the top 100 eigenvectors 
in a few days using 32 cores and 0.5 TB of RAM. This 
problem size even exceeds what was reported using 
approximate singular value decomposition solvers in 


the past (Talwalkar et al. 2008). For the 1.5 million 


node graphs we consider in our experiments described 
below, this method was more than adequate for our 
(offline) embedding training needs. 
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3.2 Deep Neural Network Methodol¬ 
ogy 

Solving the eigenvalue problem produces a d'- 
dimensional (exact) embedding Zi G for each 
Xi € X. Rather than viewing new data points as out- 
of-sample points whose mapping is estimated with in¬ 
terpolation, we instead seek to estimate the mapping 
(from X to z) itself. To this end, we now consider 
feedforward neural architectures with N hidden lay¬ 
ers, each containing M hidden units. The l-th hid¬ 
den layer nonlinearly maps the output of the previous 
layer hi-i to a new hidden representation hi G 
according to 

hi = a(Wihi-i + bi), (2) 

where each Wi is a parameter matrix, bi is a bias col¬ 
umn vector, and a is the activation function (which 
we set to tanh in our experiments). The input ho 
to the first layer is a point x in our input space 
while Wi G Wi G for I G {2,..., N}, 

and bi G for I G {!,...,TV}. The output hi of 
these N hidden layers are finally transformed into a 
corresponding point y{x) G according to 


coder pretraining that uses X only, and (ii) super¬ 
vised fine-tuning using the (xi,Zi) pairs as training 
inputs and targets. 


3.2.1 Unsupervised Pretraining 


When the input and output targets are the same, our 
deep network architecture reduces to a stacked au¬ 
toencoder (SAE) with N encoding layers and one de¬ 
coding layer. Thus, to initialize model parameters of 
the DNN to approximately recover the identity map¬ 
ping, we consider the unsupervised SAE pretraining 


procedure (Bengio 2009 Bengio et al., 20071. Here 


we introduce one hidden layer at a time, perform¬ 
ing several epochs of stochastic gradient descent to 
minimize mean squared error between our training 
samples and themselves at each intermediate network 
depth. As we add each new layer, we discard the lin¬ 
ear decoding weights from the previous optimization, 
use the previous hidden representation as input to 
the new hidden layer, and reoptimize all layer pa¬ 
rameters. Early stopping is used to prevent exact 
recovery of the identity map for each layer. 


3.2.2 Supervised Fine-Tuning 


y{x) = Wj\r+ihN + bN+i, (3) 

where Wm+i G and b^+i G R‘^' are the de¬ 

coding weight matrix and bias column vector, respec¬ 
tively. The training objective is to solve for param¬ 
eters 0* = {W *,..., W^_^_l,bl ,..., blf_^_l} that mini¬ 
mize mean squared error between the exact and pre¬ 
dicted embedding pairs: 


0* = argnun- V \\z, - y{xi) 
e n 


(4) 


This is generally accomplished with backpropagation 
and stochastic gradient descent optimization. 

It is critical that the training procedure safeguards 
against overfitting to the training sample, especially 
as deeper architectures are required in order to ap¬ 
proximate the detailed graph embeddings we wish to 
extend. Our training procedure, which was also em¬ 
ployed in ( Kamper et al.[[20l5 ) for a different applica¬ 
tion, has two steps: (i) unsupervised stacked autoen¬ 


Using the above layer-wise pretraining procedure, we 
now have initialized all parameters in the network. 
It only remains to perform several epochs of stochas¬ 
tic gradient descent to reoptimize network param¬ 
eters to minimize mean squared error between the 
exact and predicted embedding pairs according to 
Equation Q. After this training is complete, the 
DNN-based out-of-sample extension y{x) for an arbi¬ 
trary point a; S can be efficiently computed using 
the standard neural network forward pass defined by 
Equation ([^. This amounts to iV -|- 1 matrix-vector 
multiplies and vector additions, plus an evaluation of 
a for each hidden unit. For a fixed network architec¬ 
ture, this computation is constant in the number of 
training samples. 


4 Experiments 

We choose speech as the application domain for our 
scalability study for two reasons: (i) a single hour 
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Figure 1: Heat maps showing the NRMSE values for all Nystrom (a-d) and DNN (e-h) hyperparameter 
settings, for each of the four training set sizes. The same color scale is used for all images. 


of audio recordings typically produce 360,000 high¬ 
dimensional data samples known as frames, and so 
large datasets are readily available; and (ii) mani¬ 
fold learning techniques have been shown to learn 
representations effective for keyword discovery and 


search ( 

Jansen et al. 

2012 

Jansen and Niyogi 

2013 

Jansen et al. 

2013 

). We use the TIMIT corpus for 


evaluation, given the past success of manifold em¬ 


beddings for speech recognition on that data (Jansen 


et al. 2012). It consists of over 4 hours of prompted 


American English speech recordings, and is split into 
a training set consisting of roughly 1.1 million data 
samples (^3 hours) and a test set of roughly 400,000 
data samples (^1 hour). There is no speaker overlap 
between the two sets. 


4.1 Evaluation Methodology 

Our goal is to measure the approximation fidelity of 
the out-of-sample extension of the test set against a 
reference embedding that is independent of the exten¬ 


sion method. Thus, our strategy is to perform an ex¬ 
act graph embedding of the entire corpus (train-|-test) 
using the method described in Section |3.H we call 
these the reference embeddings. We define each out- 
of-sample extension using input frames and corre¬ 
sponding reference embeddings from the training set. 
This allows us to use reference embeddings for the 
training set to approximate embeddings for the test 
set for comparison against the true reference embed¬ 
dings of the test set. We measure approximation fi¬ 
delity in terms of normalized root mean squared er¬ 
ror (NRMSE) between the predicted and reference 
test embeddings; here, the normalization constant is 
taken to be the root mean squared error between the 
test set reference embeddings and a random permu¬ 
tation of those same samples. Thus, a perfect out- 
of-sample extension will have NRMSE of 0, while a 
extension that is a random mapping with the same 
empirical output distribution will have NRMSE of 1. 
In addition to defining the extensions with the entire 
training set (1.1 M samples), we also consider the 
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utility of random subsets of sizes 1,000, 10,000, and 
100,000. However, we use the same reference embed¬ 
dings for all experiments. 

Our input features are 40-dimensional, homomor- 
phically smoothed, mel-scale, log power spectrograms 
(40 equally spaced mel bands from 0-8 kHz), sampled 
every 10 ms using a 25 ms Hamming window. We 
construct the graph Laplacian using a symmetrized 
10-nearest neighbor adjacency graph with cosine dis¬ 
tance as the metric and binary edge weights. This 
amounts to a Laplacian eigenmaps specialization of 
the graph embedding framework. Finally, we keep 
the 40 eigenvectors with the largest eigenvalues to 
produce a graph embedding with the same dimen¬ 
sion as the input space. While our present focus is on 
out-of-sample extension fidelity, it is relevant to note 
that the reference embeddings match the best down¬ 


stream performance reported in (Jansen and Niyogi 


20131, which used an identical embedding strategy. 


For the baseline Nystrom method, we compute 
Equation 0 using a nearest neighbor approxima¬ 
tion to a radial basis function (RBF) kernel. This 
approximation is facilitated by preprocessing the 
training samples into a k-d tree data structure 
(as implemented in scikit-learn) for efficient re¬ 
trieval of near-neighbor sets. Note that we tried 
matching the Nystrom kernel to that used in the 
graph construction (i.e., using binary weights) as 
prescribed by (Bengio et al. 2004), but it per¬ 
formed substantially worse than introducing RBF 
weights. We consider kernel squared-bandwidths 
(7^ € {0.005,0.025, 0.05,0.25, 0.5,2.5} and number of 
neighbors A: G {1, 5,10, 25, 50,100,500}. 

For our DNN method, we consider L G {1, 2,..., 9} 
hidden layers, and for each depth we evaluate layer 
sizes of M G {20,40,60,80,100,120,140} hidden 
units. For pretraining, we use the entire train¬ 
ing set and optimize each layer for 15 epochs of 
stochastic gradient descent. Following the prescrip¬ 
tion in (Hamper et al. 2015), we use a learning rate of 
2.5 X10“^ and minibatches of 256 samples. For super¬ 
vised fine-tuning, we increase the number of epochs 
for the smaller training sets such that the total num¬ 
ber of examples processed is roughly fixed (5 epochs 
for the full train set of 1.1 million samples, 50 epochs 
for 100,000 samples, etc) to ensure adequate conver¬ 


gence. Also following (Hamper et al. 2015), for fine- 
tuning we use a learning rate of 4 x 10“^, but found 
a smaller minibatch of size 50 improved convergence. 
We use the Pylearn2 toolkit (Goodfellow et al. 2013) 
for all DNN experiments. 


4.2 Hyperparameter Sensitivity 

First, we consider the NRMSE performance as we 
vary the amount of training samples used by the out- 
of-sample extension. We drew random subsets of the 
1.1 million sample training set of sizes 1000, 10,000, 
and 100,000. Each of these subsets was used for com¬ 
putation of Equation ([^ in the Nystrom method and 
for supervised fine-tuning in the case of the DNN 
method. Table[^lists for each training subset size the 
NRMSE and test runtime for (i) Nystrom using the 
optimal set of hyperparameters, (ii) a smaller DNN 
with 2 layers of 60 units each, (iii) a larger DNN 
with 5 layers of 140 units. We see that the larger 
DNN matches or outperforms the Nystrom method 
for all training set sizes, demonstrating the power of 
deep learning for accurately approximating complex 
nonlinear functions. While the small DNN matches 
Nystrom for the 2 smaller training sets, it does not 
have sufficient parameters to keep pace as more train¬ 
ing data becomes available. 

These results emphasize the importance of each 
method’s sensitivity to the choice of hyparameters 
that specify the complexity of the out-of-sample ex¬ 
tension function. This is especially true for fully 
unsupervised representation learning settings, where 
cross-validation may not be possible. To explore this 
consideration, for the Nystrom method, we vary the 
number of (approximate) nearest neighbors that con¬ 
tribute to each test sample, as well as the kernel band¬ 
width. For the proposed DNN method we vary both 
the number of layers (JV) and the number of hidden 
units per layer (M). 

Figure shows heatmaps indicating the NRMSE 
for all hyperparameters considered for the two meth¬ 
ods for the four training set sizes. For the Nystrom 
method, performance is relatively stable across the 
range of kernel band widths, but it is more sensitive 
to the number of neighbors used in the computation. 
Moreover, the optimal number of neighbors increases 
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Table 1: Test set NRMSE and runtime (in seconds, averaged over several trials) for Nystrom with optimal 
hyperparameters, a small DNN {N = 2, M = 60), and a large DNN (TV = 5, M = 140). 


Train 

Nystrom (optimal) 

Small DNN 

Large DNN 

Size 

NRMSE 

Time 

NRMSE 

Time 

NRMSE 

Time 

Ik 

0.36 

25 

0.33 

5.4 

0.28 

33 

10k 

0.29 

240 

0.29 

5.4 

0.24 

33 

100k 

0.25 

2,200 

0.29 

5.4 

0.23 

33 

I.IM 

0.24 

12,000 

0.29 

5.4 

0.23 

33 


as more training data is available, necessitating some 
amount of parameter tuning to achieve optimal ap¬ 
proximation. 

The DNN approach reaches near-optimal perfor¬ 
mance for all training set sizes provided we include 
at least 4 layers with at least twice as many units 
than the input dimension. Critically, there is no per¬ 
formance penalty for overshooting the network size 
(other than increased forwardpass runtime, as we dis¬ 
cuss below). This suggests the DNN extension would 
require less tuning than Nystrom method to achieve 
optimal approximation in new applications. 

4.3 The Effect of Pretraining 

In typical machine learning scenarios, increasing the 
number of model parameters for a fixed training set 
size opens the method up to overfitting and poor gen¬ 
eralization. The trends for the DNN methods in Fig¬ 
ure defy this conventional wisdom, with no loss in 
approximation fidelity for any training corpus size as 
we move to deeper and wider network architectures. 
Indeed, the unsupervised pretraining is responsible 
for regularizing the parameter estimates, preventing 
overspecialization to even the smallest training set 
considered. This can be seen most clearly in the 
scatterplots in Figure where each dot represents 
a single model architecture. As the total number of 
parameters increases, the test NRMSF of the pre¬ 
trained models decays roughly monotonically. The 
same architectures without pretraining track simi¬ 
larly for smaller model sizes, but, due to overfitting, 
the test performance degrades as more parameters 
are made available. This behavior is especially clear 
in the case of limited training examples, though we 
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Figure 2: Approximation fidelity vs. number of DNN 
parameters for (a) 1,000 and (b) 1.1 million training 
set sizes, both with and without unsupervised pre¬ 
training. 
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see the beginnings of the same effect for the largest 
architectures even in the presence of the full training 
set. 


4.4 Test Runtimes 

Finally, we consider the computational efficiency of 
applying the two out-of-sample extension methods 
to a test corpus. Table lists the NRMSE values 
and corresponding test times in seconds (i.e., the 
time taken to extend the embedding to the entire 
400,000 sample test set) for the Nystrom method 
with optimal hyperparameters and two DNN archi¬ 
tectures. As expected, the runtimes for the nonpara- 
metric Nystrom method increase as the training sam¬ 
ple gets bigger, since nearest neighbor retrieval re¬ 
mains expensive even when using the k-d tree data 
structure. Meanwhile, the DNN runtimes are virtu¬ 
ally constant for a hxed number of parameters. The 
small DNN can consume the full training set and pro¬ 
duce extensions over 4 times faster than the Nystrom 
method with the smallest training set, while simul¬ 
taneously reducing NRMSE by 20% relative. More¬ 
over, the best DNN roughly matches the approxima¬ 
tion fidelity of the best Nystrom NRMSE, and the 
DNN accomplishes it ^350 times faster. For all train¬ 
ing sizes tested, DNN extensions can provide signfi- 
cant speedup without any sacrihces in fidelity, and, 
in many cases, improve both speed and fidelity. 

For speech processing applications, where interac¬ 
tivity is often critical, even our largest networks can 
process test samples faster than real-time. Moreover, 
with the large DNN consisting of 5 layers of 140 hid¬ 
den units, optimal NRMSE is achieved at speeds 120 
times faster than real-time using a single processor. 
This is comparable to the extraction speed of tra¬ 
ditional acoustic front-ends in state-of-the-art imple¬ 
mentations (jPove^^F^ 2011). 


5 Conclusion 

In this work, we used modern deep learning method¬ 
ologies to perform out-of-sample extensions of graph 
embeddings. Compared with the standard Nystrom 
sampling-based out-of-sample extension, the DNNs 


approximate the embeddings with higher fidelity and 
are substantially more computationally efficient. Us¬ 
ing unsupervised pretraining for parameter initializa¬ 
tion improves DNN generalization, making our DNN 
approach highly stable across a wide variety of hy¬ 
perparameter settings. These results support deep 
neural networks with unsupervised pretraining as an 
ideal choice for out-of-sample extensions of learned 
manifold representations. 
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